Liquid: Unifying Nearline and Offline Big Data Integration

نویسندگان

Raul Castro Fernandez

Peter R. Pietzuch

Jay Kreps

Neha Narkhede

Jun Rao

Joel Koshy

Dong Lin

Chris Riccomini

Guozhang Wang

چکیده

With more sophisticated data-parallel processing systems, the new bottleneck in data-intensive companies shifts from the back-end data systems to the data integration stack, which is responsible for the pre-processing of data for back-end applications. The use of back-end data systems with different access latencies and data integration requirements poses new challenges that current data integration stacks based on distributed file systems—proposed a decade ago for batch-oriented processing—cannot address. In this paper, we describe Liquid, a data integration stack that provides low latency data access to support near real-time in addition to batch applications. It supports incremental processing, and is cost-efficient and highly available. Liquid has two layers: a processing layer based on a stateful stream processing model, and a messaging layer with a highly-available publish/subscribe system. We report our experience of a Liquid deployment with backend data systems at LinkedIn, a data-intensive company with over 300 million users.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Design and Test of the Real-time Text mining dashboard for Twitter

One of today's major research trends in the field of information systems is the discovery of implicit knowledge hidden in dataset that is currently being produced at high speed, large volumes and with a wide variety of formats. Data with such features is called big data. Extracting, processing, and visualizing the huge amount of data, today has become one of the concerns of data science scholar...

متن کامل

Intensional RDB Manifesto: a Unifying NewSQL Model for Flexible Big Data

In this paper we present a new family of Intensional RDBs (IRDBs) which extends the traditional RDBs with the Big Data and flexible and ’Open schema’ features, able to preserve the user-defined relational database schemas and all preexisting user’s applications containing the SQL statements for a deployment of such a relational data. The standard RDB data is parsed into an internal vector key/v...

متن کامل

Design and Implementation of a Storage Repository Using Commonality Factoring

In this paper, we discuss the design of a data normalization system that we term commonality factoring. A real-world implementation of a storage system based upon data normalization requires design of the data normalization itself, of the storage repository for the data, and of the protocols to be used between applications performing data normalization and the server software of the repository....

متن کامل

Power-aware Proactive Storage-tiering Management for High-speed Tiered-storage Systems

Large-scale high-speed mass-storage systems account for a large part of the energy consumed at data centers. To conserve energy consumed by these storage systems, we propose a high-speed tiered-storage system with a poweraware proactive method of storage-tiering management that minimizes loss of performance, which we have called the energy-efficient High-speed Tiered-Storage system (eHiTS). eHi...

متن کامل